Conversation
Update from original
…prakbanken_swe and removed deprecated commands from run.sh
…h python 2 and 3 on the request of @jtrmal (I think they are slower now because we use more regexes). Changed the preprocessing so case is not normalised and altered default behaviour to delete sentence-final '.' rather than convert to a token because it is more often the case that they are not spoken aloud.
…ased systems. Changed the scoring scripts in local/ to be similar to WSJ to get better analyses and changed the local/wer* scripts to fit this recipe.
… but particular Danish characters. Corrected error in previous commit that changes openfst version tools/Makefile
Update from original
Contributor
|
Thanks... let's wait until the lexicon is available at openslr before
merging it.
In general we don't like to overwrite files at openslr if they have been
there a while, but this isn't a hard-and-fast rule.
Did you plan for the new lexicon to have the same filename, and what are
the differences from the old lexicon? I'm wondering whether we should give
it a different filename.
…On Fri, Dec 2, 2016 at 9:36 AM, Andreas Søeborg Kirkedal < ***@***.***> wrote:
The update in this PR makes te modifications to sprakbanken that was
requested for sprakbanken_swe, makes the python scripts work with python
2.7.x, simplifies the recipe and gives better results. Because I have
changed the data preprocessing, a new lexicon needs to be uploaded to
openslr, but I cannot attach it to the PR.
------------------------------
You can view, comment on, or merge this pull request online at:
#1242
Commit Summary
- Merge pull request #4 from kaldi-asr/master
- Made the same modifications to sprakbanken as @jtrmal suggested for
sprakbanken_swe and removed deprecated commands from run.sh
- Modified python scripts called by sprak_data_prep.sh so they work
with python 2 and 3 on the request of @jtrmal (I think they are slower now
because we use more regexes). Changed the preprocessing so case is not
normalised and altered default behaviour to delete sentence-final '.'
rather than convert to a token because it is more often the case that they
are not spoken aloud.
- Modified run.sh and tuned #leaves and #Gauss on dev set for for
GMM-based systems. Changed the scoring scripts in local/ to be similar to
WSJ to get better analyses and changed the local/wer* scripts to fit this
recipe.
- Modify the filters in local/wer_* so they remove accents and
umlauts, but particular Danish characters. Corrected error in previous
commit that changes openfst version tools/Makefile
File Changes
- *M* egs/sprakbanken/s5/local/copy_dict.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-0> (6)
- *M* egs/sprakbanken/s5/local/create_datasets.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-1> (2)
- *M* egs/sprakbanken/s5/local/dict_prep.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-2> (129)
- *M* egs/sprakbanken/s5/local/norm_dk/format_text.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-3> (11)
- *A* egs/sprakbanken/s5/local/norm_dk/numbersLow.tbl
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-4> (265)
- *M* egs/sprakbanken/s5/local/normalize_transcript.py
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-5> (17)
- *M* egs/sprakbanken/s5/local/normalize_transcript_prefixed.py
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-6> (30)
- *M* egs/sprakbanken/s5/local/score.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-7> (124)
- *M* egs/sprakbanken/s5/local/sprak_data_prep.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-8> (62)
- *A* egs/sprakbanken/s5/local/wer_hyp_filter
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-9> (5)
- *A* egs/sprakbanken/s5/local/wer_output_filter
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-10> (5)
- *A* egs/sprakbanken/s5/local/wer_ref_filter
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-11> (5)
- *M* egs/sprakbanken/s5/local/writenumbers.py
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-12> (1)
- *M* egs/sprakbanken/s5/run.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-13> (311)
Patch Links:
- https://github.com/kaldi-asr/kaldi/pull/1242.patch
- https://github.com/kaldi-asr/kaldi/pull/1242.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1242>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu2NdJPg_Q6po6UU4Wtm-6FaIZzr-ks5rECzugaJpZM4LCnQa>
.
|
Contributor
Author
|
The words in the new lexicon are not case normalised. Otherwise, the old and new version are the same. I had thought to just replace the old lexicon with the new one, but if you would like to keep the old version, I can rename the new one to e.g. lexicon-da-nonorm.tgz |
Contributor
|
Yes, please rename, and email Yenda separately with the new file.
…On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal < ***@***.***> wrote:
The words in the new lexicon are not case normalised. Otherwise, the old
and new version are the same. I had thought to just replace the old lexicon
with the new one, but if you would like to keep the old version, I can
rename the new one to e.g. lexicon-da-nonorm.tgz
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG-BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
.
|
Contributor
|
Lexicon published -- http://www.openslr.org/8/
On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey <notifications@github.com>
wrote:
… Yes, please rename, and email Yenda separately with the new file.
On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal <
***@***.***> wrote:
> The words in the new lexicon are not case normalised. Otherwise, the old
> and new version are the same. I had thought to just replace the old
lexicon
> with the new one, but if you would like to keep the old version, I can
> rename the new one to e.g. lexicon-da-nonorm.tgz
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1242 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG-
BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa>
.
|
Contributor
|
Andreas, let me know when the recipe is ready to check (e.g. the filename
matches the one in openslr).
…On Mon, Dec 5, 2016 at 9:46 AM, jtrmal ***@***.***> wrote:
Lexicon published -- http://www.openslr.org/8/
On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey ***@***.***>
wrote:
> Yes, please rename, and email Yenda separately with the new file.
>
>
> On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal <
> ***@***.***> wrote:
>
> > The words in the new lexicon are not case normalised. Otherwise, the
old
> > and new version are the same. I had thought to just replace the old
> lexicon
> > with the new one, but if you would like to keep the old version, I can
> > rename the new one to e.g. lexicon-da-nonorm.tgz
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#1242 (comment)>,
> or mute
> > the thread
> > <https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG-
> BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
> > .
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1242 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-
auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuyf5gS-VZXGlqjxjR5rkGtMKIYezks5rFCO2gaJpZM4LCnQa>
.
|
Contributor
Author
|
Ready for review, Dan.
2016-12-05 17:48 GMT+01:00 Daniel Povey <notifications@github.com>:
… Andreas, let me know when the recipe is ready to check (e.g. the filename
matches the one in openslr).
On Mon, Dec 5, 2016 at 9:46 AM, jtrmal ***@***.***> wrote:
> Lexicon published -- http://www.openslr.org/8/
>
> On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey ***@***.***>
> wrote:
>
> > Yes, please rename, and email Yenda separately with the new file.
> >
> >
> > On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal <
> > ***@***.***> wrote:
> >
> > > The words in the new lexicon are not case normalised. Otherwise, the
> old
> > > and new version are the same. I had thought to just replace the old
> > lexicon
> > > with the new one, but if you would like to keep the old version, I
can
> > > rename the new one to e.g. lexicon-da-nonorm.tgz
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <#1242 (comment)
>,
> > or mute
> > > the thread
> > > <https://github.com/notifications/unsubscribe-
auth/ADJVuzWzHQuVaAQXG-
> > BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
> > > .
> > >
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#1242 (comment)>,
> or mute
> > the thread
> > <https://github.com/notifications/unsubscribe-
> auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa>
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1242 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVuyf5gS-
VZXGlqjxjR5rkGtMKIYezks5rFCO2gaJpZM4LCnQa>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABZKbKG2p8o6xMBG5ACFW_kBQwW6M8lFks5rFEBjgaJpZM4LCnQa>
.
--
Med venlig hilsen
Andreas Søeborg Kirkedal
|
danpovey
reviewed
Dec 11, 2016
| dictdir=data/local/dict | ||
| espeakdir='espeak-1.48.04-source' | ||
| mkdir -p $dir | ||
| mkdir -p $dictsrc $dictd ir |
Contributor
There was a problem hiding this comment.
seems to be a space in the middle of a word.
Contributor
|
There is a conflict, can you please merge and resolve? |
Merging to resolve conflict
dresen
added a commit
to dresen/kaldi
that referenced
this pull request
Dec 15, 2016
Swedish changes (kaldi-asr#1242)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The update in this PR makes te modifications to sprakbanken that was requested for sprakbanken_swe, makes the python scripts work with python 2.7.x, simplifies the recipe and gives better results. Because I have changed the data preprocessing, a new lexicon needs to be uploaded to openslr, but I cannot attach it to the PR.